84 research outputs found
Seeing What You're Told: Sentence-Guided Activity Recognition In Video
We present a system that demonstrates how the compositional structure of
events, in concert with the compositional structure of language, can interplay
with the underlying focusing mechanisms in video action recognition, thereby
providing a medium, not only for top-down and bottom-up integration, but also
for multi-modal integration between vision and language. We show how the roles
played by participants (nouns), their characteristics (adjectives), the actions
performed (verbs), the manner of such actions (adverbs), and changing spatial
relations between participants (prepositions) in the form of whole sentential
descriptions mediated by a grammar, guides the activity-recognition process.
Further, the utility and expressiveness of our framework is demonstrated by
performing three separate tasks in the domain of multi-activity videos:
sentence-guided focus of attention, generation of sentential descriptions of
video, and query-based video search, simply by leveraging the framework in
different manners.Comment: To appear in CVPR 201
Saying What You're Looking For: Linguistics Meets Video Search
We present an approach to searching large video corpora for video clips which
depict a natural-language query in the form of a sentence. This approach uses
compositional semantics to encode subtle meaning that is lost in other systems,
such as the difference between two sentences which have identical words but
entirely different meaning: "The person rode the horse} vs. \emph{The horse
rode the person". Given a video-sentence pair and a natural-language parser,
along with a grammar that describes the space of sentential queries, we produce
a score which indicates how well the video depicts the sentence. We produce
such a score for each video clip in a corpus and return a ranked list of clips.
Furthermore, this approach addresses two fundamental problems simultaneously:
detecting and tracking objects, and recognizing whether those tracks depict the
query. Because both tracking and object detection are unreliable, this uses
knowledge about the intended sentential query to focus the tracker on the
relevant participants and ensures that the resulting tracks are described by
the sentential query. While earlier work was limited to single-word queries
which correspond to either verbs or nouns, we show how one can search for
complex queries which contain multiple phrases, such as prepositional phrases,
and modifiers, such as adverbs. We demonstrate this approach by searching for
141 queries involving people and horses interacting with each other in 10
full-length Hollywood movies.Comment: 13 pages, 8 figure
Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
Understanding language goes hand in hand with the ability to integrate
complex contextual information obtained via perception. In this work, we
present a novel task for grounded language understanding: disambiguating a
sentence given a visual scene which depicts one of the possible interpretations
of that sentence. To this end, we introduce a new multimodal corpus containing
ambiguous sentences, representing a wide range of syntactic, semantic and
discourse ambiguities, coupled with videos that visualize the different
interpretations for each sentence. We address this task by extending a vision
model which determines if a sentence is depicted by a video. We demonstrate how
such a model can be adjusted to recognize different interpretations of the same
underlying sentence, allowing to disambiguate sentences in a unified fashion
across the different ambiguity types.Comment: EMNLP 201
Measuring Social Biases in Grounded Vision and Language Embeddings
We generalize the notion of social biases from language embeddings to
grounded vision and language embeddings. Biases are present in grounded
embeddings, and indeed seem to be equally or more significant than for
ungrounded embeddings. This is despite the fact that vision and language can
suffer from different biases, which one might hope could attenuate the biases
in both. Multiple ways exist to generalize metrics measuring bias in word
embeddings to this new setting. We introduce the space of generalizations
(Grounded-WEAT and Grounded-SEAT) and demonstrate that three generalizations
answer different yet important questions about how biases, language, and vision
interact. These metrics are used on a new dataset, the first for grounded bias,
created by augmenting extending standard linguistic bias benchmarks with 10,228
images from COCO, Conceptual Captions, and Google Images. Dataset construction
is challenging because vision datasets are themselves very biased. The presence
of these biases in systems will begin to have real-world consequences as they
are deployed, making carefully measuring bias and then mitigating it critical
to building a fair society
Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas
We demonstrate a reinforcement learning agent which uses a compositional
recurrent neural network that takes as input an LTL formula and determines
satisfying actions. The input LTL formulas have never been seen before, yet the
network performs zero-shot generalization to satisfy them. This is a novel form
of multi-task learning for RL agents where agents learn from one diverse set of
tasks and generalize to a new set of diverse tasks. The formulation of the
network enables this capacity to generalize. We demonstrate this ability in two
domains. In a symbolic domain, the agent finds a sequence of letters that is
accepted. In a Minecraft-like environment, the agent finds a sequence of
actions that conform to the formula. While prior work could learn to execute
one formula reliably given examples of that formula, we demonstrate how to
encode all formulas reliably. This could form the basis of new multitask agents
that discover sub-tasks and execute them without any additional training, as
well as the agents which follow more complex linguistic commands. The
structures required for this generalization are specific to LTL formulas, which
opens up an interesting theoretical question: what structures are required in
neural networks for zero-shot generalization to different logics?Comment: Accepted in IROS 202
Deep compositional robotic planners that follow natural language commands
We demonstrate how a sampling-based robotic planner can be augmented to learn
to understand a sequence of natural language commands in a continuous
configuration space to move and manipulate objects. Our approach combines a
deep network structured according to the parse of a complex command that
includes objects, verbs, spatial relations, and attributes, with a
sampling-based planner, RRT. A recurrent hierarchical deep network controls how
the planner explores the environment, determines when a planned path is likely
to achieve a goal, and estimates the confidence of each move to trade off
exploitation and exploration between the network and the planner. Planners are
designed to have near-optimal behavior when information about the task is
missing, while networks learn to exploit observations which are available from
the environment, making the two naturally complementary. Combining the two
enables generalization to new maps, new kinds of obstacles, and more complex
sentences that do not occur in the training set. Little data is required to
train the model despite it jointly acquiring a CNN that extracts features from
the environment as it learns the meanings of words. The model provides a level
of interpretability through the use of attention maps allowing users to see its
reasoning steps despite being an end-to-end model. This end-to-end model allows
robots to learn to follow natural language commands in challenging continuous
environments.Comment: Accepted in ICRA 202
- …